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Abstract 

Protein identification is one of the major task of Pro- 
teomics researchers. Protein identification could be re- 
sumed by searching the best match between an experimental 
mass spectrum and proteins from a database. Nevertheless 
this approach can not be used to identify new proteins or 
protein variants. In this paper an evolutionary approach 
is proposed to discover new proteins or protein variants 
thanks a "de novo sequencing " method. This approach has 
been experimented on a specific grid called Grid5000 with 
simulated spectra and also real spectra. 

1. Introduction 

Proteomics can be defined as the global analysis of pro- 
teins. Protein identification is one of the major task of Pro- 
teomic researchers as it can help to understand the biologi- 
cal mechanisms in the living cells. All the current methods 
use data from mass-spectrometers and generally give good 
results. But in the case of protein variants or new proteins, 
these methods can only recognize a protein if it is stored in 
a database and can not clearly explain why this protein is 
different from any other in the database. The aim of our ap- 
proach is to find the entire sequence of a protein, even in the 
case of variants or unknown proteins. To do that, we need 
to identify the different peptides that composed the protein. 
First, their mass (their chemical formula) have to be found 
with a MS spectrum and secondly, from their mass, their 
sequence can be found with MS/MS spectra. In fact, when 
peptides are known, we can obtain the complete protein. 

This article is organized as follows. Section|2]deals with 



the specificities of protein variants and new protein iden- 
tification problems; section [3] describes our approach and 
the different algorithms that compose it; section [4] intro- 
duces the parallel framework; section |5]presents our results 
and discusses them and finally conclusions and perspectives 
about this work are provided. 

2. The Positioning of the Protein Variants and 
New Proteins Identification Problem 

The identification of new proteins and protein variants 
is a complex problem. All the existing protein identifi- 
cation methods are based on two types of data: MS and 
MS/MS spectra (MS for Mass Spectrometry) which are 
mass/intensity spectra. A MS spectrum is obtained by ex- 
traction of an experimental protein from a proteins mix, its 
digestion by a specific enzyme and its analysis in a mass 
spectrometer. From a MS spectrum, databases allow to 
identify all the peptides by their masses. Techniques us- 
ing MS spectra for protein identification are identification 
methods by peptide mass fingerprint (PMF). The scoring 
of these methods is based of the comparison of an exper- 
imental peptide mass list with a theoretical peptide mass 
list J3] QT) . They give good results but they only find the 
closest protein to the experimental one without more infor- 
mation. A way to overcome the lacks of MS data is to use 
also MS/MS data (tandem mass spectrometry). Each pep- 
tide from the MS spectrum is selected and fragmented to 
obtain the corresponding MS/MS spectrum. The ions de- 
tected are characteristic of the structure of the parent pep- 
tide. Thus it is theoretically possible to obtain the sequence 
of each peptide from the digested protein. The use of MS 
data (mass of the peptides) combined to MS/MS data (par- 



tial sequence of the peptides) data increase the accuracy of 
the PMF techniques [ 1 , 9 1. These scores use several proper- 
ties on the ions obtained by MS/MS spectra in order to find 
amino acid sequences. With partial amino acid sequences 
and masses, proteins can be distinguished easier than with 
masses only. However, it is not sufficient to identify un- 
known proteins. 

An alternative method named de novo sequencing has 
been proposed, using tandem mass spectrometry. It works 
on random sequence of proteins in order to find the exper- 
imental one (without databases). In this case the identifi- 
cation is based on random peptides or peptides result of a 
earlier identification (made by specific tools) ll3ll4l [T0l[T3ll . 
But the MS/MS data are so fragmented (the deduced se- 
quences are limited) and the number of theoretical protein 
that can be generated is so large that this kind of technique 
is only use on small amount of data. We speak about de 
novo peptide sequencing. Furthermore, alignment tools as 
Blast are necessary to find the closest peptide corresponding 
to the result sequence and validate it. 

Evolutionary approaches as optimization method have 
been already used against the huge research space of the 
de novo peptide sequencing problem [7, 10 1 and give in- 
teresting results. So we have decided to design a genetic 
algorithm to make our de novo protein sequencing. 

3. General Approach 

According to the data available, the number of possible 
amino acid sequences is too huge to be enumerated. So a 
genetic algorithm (GA) has been chosen for its ability to 
explore large solutions space. 

Find protein sequences needs two complementary steps: 
find the right peptidic masses with MS spectrum and from 
them find the corresponding sequences with MS/MS spec- 
tra. The first step can be describe as follow: the individuals 
(randomly initialized) are digested (theoretical digestion) to 
be in a peptides list form and thanks to our evaluation func- 
tion our GA can generate individuals that corresponding to 
the right peptidic masses list. We will now detailed each of 
these parts. 



is greatly increased due to these miss cleavages. We de- 
veloped a linear and iterative algorithm which realizes the 
theoretical digestion according to the grammar of the en- 
zyme chosen and a number of miss cleavage allowed. Our 
algorithm works on a two-time basis: first, the peptides are 
computed without miss cleavage and then, level by level the 
number of miss cleavage is increased until the wanted value. 

The digestion process is an essential algorithm for Pro- 
teomics approaches. In the next paragraph, we will present 
the optimization method. 

3.2. The genetic algorithm (GA) 

A Genetic Algorithm (GA) works by repeatedly modify- 
ing a population of artificial structures through the applica- 
tion of genetic operators (crossover and mutation) [8]. The 
goal is to find the best possible solution or, at least good, so- 
lutions for the problem. Figure Q] shows the global scheme 
of a genetic algorithm. Our GA has been developed thanks 
to the ParadisEO platform which is a C++ GPL (General 
Public Licence) platform made for the conception of evolu- 
tionary algorithm J2j. It may allow to find the right peptidic 
masses list corresponding to a MS spectrum. 
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Figure 1. The general flowchart of a genetic 
algorithm, p is the probability of mutation. 



3.1. Digestion Process 

The digestion process corresponds to the cleavage of a 
protein in smaller residues called peptides. The cleavage 
points in the protein depend on the type of the used diges- 
tion enzyme because to each enzyme corresponds a cleav- 
age grammar. According to the chosen enzyme, the list of 
potential peptides is easily obtained. Nevertheless, in the 
real process, the enzyme can miss some cleavage points 
called miss cleavage. So the number of potential peptides 



• Individuals Representation: the chosen representa- 
tion for an individual is a list of peptides for 3 rea- 
sons: each individual is digested one time during the 
initialization process, the original sequence can be eas- 
ily computed; the evaluation function and the fragmen- 
tation process need the proteins to be in a peptides 
list form. In details an individual is a list of peptides 
(with its number of miss cleavage), each peptide is an 
amino acid chain and each amino acid can have post- 
traductional modifications. 



Evaluation Function: it is a completely original eval- 
uation function based on a optimized version of the 
algorithm developed by A.L. Rockwood Ifl2l to com- 
pute isotopic distributions. The major interest of our 
function is a direct comparison of a experimental MS 
spectrum with a simulated one. In fact our evaluation 
function does not need the mono-isotopic mass list ex- 
tracted from the experimental MS spectrum. An in- 
dividual of our GA is translated into a chemical for- 
mula list. For each chemical formula (so for each pep- 
tide), the isotopic distribution is gradually computed 
and, peptide by peptide, the simulated spectrum is cal- 
culated. The evaluation function computes the corre- 
lation between each theoretical peptide and the exper- 
imental spectrum. So all the partial score of the the- 
oretical peptides correspond to the fitness (the score) 
of an individual. However, the evaluation function is 
time expensive: a protein of 500 amino acids needs 
one second in average to be evaluated. 

This evaluation function has been validated by 
a research of known proteins in databases. To 
make our validation, we use the UNIPROT 
database in FASTA format that can be download 
at www.expasy.uniprot.org/database/download.shtml. 

Individuals Initialization: this process respects a de 
novo sequencing approach. Individuals are randomly 
generated according to a variable length (in amino 
acids). During the evolution of the GA, the size of indi- 
viduals will change thanks to mutation operators (pep- 
tide insertion/deletion, amino acids insertion/deletion, 
amino acid substitution and post-traductional modifi- 
cation mutations). A random generation allows to have 
a high diversity of population at the beginning of our 
search. 

Operators: they allow a diversified and intensified 
search. In a GA, there are two types of operators: 
the crossover operator and the mutation operator. The 
crossover operator allows to generate "children" indi- 
viduals from "parents" individuals. In our case, we use 
the well known 1 -point crossover operator. 

The mutation operator allows to have a genetic diver- 
sity in the new individuals. The individuals gener- 
ated by crossover can have additional mutation. In 
our GA, there are 6 types of mutation: the random 
peptide insertion/deletion, the random amino acid in- 
sertion/deletion, the amino acid substitution according 
a probability from a substitution matrix (by default is 
the BLOSUM62 matrix @) and the post-traductional 
modification. The different mutations have an equal 
probability to be selected. All these operators allows 
the GA to get very close to the real biological model. 



4. A parallel GA 

As we have previously noticed, the scoring function 
is time expensive. The GA was developed thanks Par- 
adisEO Q. ParadisEO is one of the rare frameworks that 
provide the most common parallel and distributed models. 
These models concern the island-based running of meta- 
heuristics, the evaluation of a population, and the evalua- 
tion of a single solution. They are portable on distributed- 
memory machines and shared-memory multi-processors as 
they are implemented using standard libraries such as MPI, 
PVM and PThreads. The models can be exploited in a trans- 
parent way, one has just to instantiate their associated Par- 
adisEO components. 

4.1. Model 

As our scoring function is time consuming, we decide 
to parallelize the GA by simultaneously evaluating several 
individuals. The used model is a master/slave one. The 
master sends to slaves individuals to evaluate and the slaves 
send back the fitness value. The system is fault tolerant, the 
master can detect when a slave is available and send it a 
individual thanks a dispatcher. 

4.2. Infrastructure 

We decide to develop our project on a grid. Grid 
computing uses the resources of many separate com- 
puters connected by a network (usually the Internet) 
to solve large-scale computation problems. We use 
Grid5000 (www.grid5000.org) resources for our applica- 
tion. Grid5000 has resources located in Lille, Paris-Orsay, 
Rennes, Bordeaux, Toulouse, Lyon, Grenoble, Sophia An- 
tipolis. Grid5000 uses Renater (the French national network 
for research and education) network to connect the different 
sites which speed is 2.5 Gbit/s. 

5 Results 

In this part, we will present the results of the GA and its 
parallelization. For all our experiments, the parameters of 
our GA have been set to 100 for the population size, 0.9 for 
the crossover rate (most used value) and 0.6 for the mutation 
rate (experimental value giving the best convergence speed). 

5.1. Biological validation 

In order to validate our first results, we compare the spec- 
trum of our best individual with the simulated one of the 
Apo-AI protein. 
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Figure 2. Apo-AI simulated spectrum vs best 
individual spectrum. 



On figure|2l we see there a good correlation between the 
experimental spectrum and the best individual of our GA. 
The most important value is the mass because, for the mo- 
ment, all the simulated spectra that we generate have an in- 
tensity normalized to 1 (a high intensity only indicates that 
more than one peptide have the same mass). 
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Table 1. Results gained for several type of 
data (Sim for simulated, Exp for experimen- 
tal data, M for matches). 



Furthermore, Table Q] shows the results obtained with 
different types of data: simulated spectra computed from 
sequence in FASTA format and experimental spectra from 
mass spectrometer. In this table, we remark that results on 
simulated data are better than results on experimental data. 
It's due to convergence speed of the GA on simulated data. 
So we need to adapt the GA engine (precisely the number 
of generations and the stop criterion) to the specificities of 
the data. This can be possible when the second step of our 
approach will be completely defined. 

Table [2] shows that the first of our approach is reached 
because we find (globally) the right masses of peptide. Al- 
though we have the correct chemical formula, we do not 
have necessary the right peptide sequence. But from the 
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Table 2. Matching Apo-AI peptides (AAI pep) 
and best individual peptides. 5 is the mass 
difference, 5= ( Apo-AI peptide - best individ- 
ual peptide Apo-AI peptide). There are also 
11 exact sequence matches which are not 
show here. 



correct chemical formula and a MS/MS spectrum, we can 
extend the evolution of the GA to the right peptide sequence 
and so to the right protein sequence (second step for our ap- 
proach). 

5.2. GA robustness 

A genetic algorithm is a stochastic algorithm and each 
execution does not always lead to the optimal solution. 
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Table 3. Statistics according to the data used. 
AAI: Apo-AI Human, CC: Cyt-C Bovin, AC: Al- 
bumin Chicken. S/E: simulated/experimental 
spectrum, a: standard deviation. 



To study the behavior of the GA we perform 15 experi- 
ments (runs of the GA) for each protein. Table [3] summa- 
rizes some statistics over the experiments: the optimal fit- 
ness (in the case of experimental data, no protein matches 
exactly with the spectrum, so no value is given), fitness of 
the best individual, mean of the fitness solutions, median 
and standard deviation. 

Globally, our GA is quite robust on all the data as the 
median and the mean are very similar. We remark that we 



need to improve the GA to reach optimal value at each time. 
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5.3. Parallel version 

We experiment our parallel version on experimental pro- 
teins. We consider that Ts is the time taken to run the fastest 
serial algorithm on one processor and Tp is the time taken 
by a parallel algorithm on N processors. To measure the 
gain of the parallelization, we compute two measures: the 
Speed-up = Sn= -jr- and the Efficiency = 
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Table 4. Execution time (in sec), Speed-up 
(S N ) and Efficiency (Eff) on Grid5000 for 2 ex- 
perimental spectra according to the number 
of processors (Nb Proc). 



Table |4] summarizes values of the two measures for the 
Apo-AI and Cytc proteins. We can observe that for Apo- 
AI the efficiency is less than 1 for any number of proces- 
sors and is very bad for more than 32 processors whereas 
for Cytc we can observe supra linear performance. The 
reasons of such an observation can be due to load balanc- 
ing (Grid5000 is an heterogeneous grid); a communication 
overhead or the potentially volatile nodes (Grid5000 is com- 
pound of PC clusters from university that could be poten- 
tially used by students or be turned off). 

6. Conclusions and Perspectives 

In this article a genetic algorithm has been proposed to 
discover the sequence of an experimental protein. We have 
explained the limits of the current methods and the inter- 
est of a GA. The novelties of our approach are our evalu- 
ation function and the application of a de novo sequencing 
method on complete proteins and not only on small pep- 
tides. Furthermore, we have experimented a parallel version 
of our GA on a Grid. A lot of work remains to increase the 
potential of the approach and the performance of the GA 
in order to find the right peptide sequences that compound 
the experimental protein. We will continue to work on our 
evaluation function in order to find new ones and manage to 
combine some of them in order to have better quality solu- 
tions. 



[1] V. Bafna and N. Edwards. Scope: A probabilistic model 
for scoring tandem mass spectra against a peptide database. 
Bioinformatics, 1(1): 1—9, 2001. 

[2] S. Cahon, N. Melab, and E.-G. Talbi. ParadisEO: A Frame- 
work for the Reusable Design of Parallel and Distributed 
Metaheuristics. Journal of Heuristics, 10(3):357-380, May 
2004. 

[3] V. Dancik, T. Addon, and K. Clauser. De novo peptide se- 
quencing via tandem mass spectrometry. Journal of Compu- 
tational Biology, 6(3/4):327-342, 1999. 

[4] A. Frank and P. Pevzner. Pepnovo: de novo peptide se- 
quencing via probabilistic network. Analytical Chemistry, 
77:964-973, 2005. 

[5] R. Gras, M. Miiller, E. Gasteiger, P. B. S. Gay, W. Bienvenut, 
C. Hoogland, J. Sanchez, A. Bairoch, D. Hochstrasser, 
and R. Appel. Improving protein identification from pep- 
tide mass fingerprinting through a parameterized multi-level 
scoring algorithm and an optimized peak detection. Elec- 
trophoresis, 20:3535-3550, 1999. 

[6] S. Henikoff and J. Henikoff. Amino acid substitution ma- 
trices from protein blocks. Proceedings of the National 
Academy of Sciences, 89:10915-10919, 1992. 

[7] A. Heredia-Langner, W. Cannon, K. Jarman, and K. Jar- 
man. Sequence optimization as an alternative to de novo 
analysis of tandem mass spectrometry data. Bioinformatics, 
20(14):2296-2304, 2004. 

[8] J. Holland. Adaptation in Natural and Artificial Systems. 
University of Michigan Press, 1975. 

[9] J. Magnin, A. Masselot, C. Menzel, and J. Colinge. OLAV- 
PMF: a novel scoring scheme for high-throughput peptide 
mass fingerprinting. Journal of Proteome Research, 3:55- 
60, 2004. 

[10] J. Marlard, A. Heredia-langner, D. B. K.H., Jarman, and 
W. Cannon. Constrained de novo peptide identification via 
multi-objective optimization. In International Parallel and 
Distributed Processing Symposium, page 191a, 2004. 

[11] F. Monigatti and P. Berndt. Algorithm for accurate similarity 
measurements of peptide mass fingerprints and its applica- 
tion. American Society for Mass Spectrometry, 16:13-21, 
2005. 

[12] A. Rockwood, S. V. Orden, and R. Smith. Rapid Calculation 
of Isotope Distribtion. Analytical Chemistry, 67(15):2698- 
2704, 1995. 

[13] B. Searle, S. Dasari, M. Turner, A. Reddy, D. Choi, 
P. Wilmarth, A. McCormack, L. David, and S. Nagalla. 
High-throughput identification of proteins and unanticipated 
sequence modifications using a mass-based alignment algo- 
rithm for MS/MS de novo sequencing results. Analytical 
Chemistry, 76:2220-2230, 2004. 



Time (seconds) 
4S i 




Chlid 1 



Chlid 2 



Protein lenght (amino acids) 



Slaves 



Individuals 



Random 
proteins 



r—- -*\ t 




z 1 — ■x 



Master 





, each allowed post-traductional modifical 
can be activated or not (0 or 1) 



